7,066 research outputs found

    From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

    Full text link
    Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks

    Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

    Full text link
    Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning. To address this issue, we propose a hierarchical LSTM with adjusted temporal attention (hLSTMat) approach for video captioning. Specifically, the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the video caption generation. To demonstrate the effectiveness of our proposed framework, we test our method on two prevalent datasets: MSVD and MSR-VTT, and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets

    Path Tracking of a Wheeled Mobile Manipulator through Improved Localization and Calibration

    Get PDF
    This chapter focuses on path tracking of a wheeled mobile manipulator designed for manufacturing processes such as drilling, riveting, or line drawing, which demand high accuracy. This problem can be solved by combining two approaches: improved localization and improved calibration. In the first approach, a full-scale kinematic equation is derived for calibration of each individual wheel’s geometrical parameters, as opposed to traditionally treating them identical for all wheels. To avoid the singularity problem in computation, a predefined square path is used to quantify the errors used for calibration considering the movement in different directions. Both statistical method and interval analysis method are adopted and compared for estimation of the calibration parameters. In the second approach, a vision-based deviation rectification solution is presented to localize the system in the global frame through a number of artificial reflectors that are identified by an onboard laser scanner. An improved tracking and localization algorithm is developed to meet the high positional accuracy requirement, improve the system’s repeatability in the traditional trilateral algorithm, and solve the problem of pose loss in path following. The developed methods have been verified and implemented on the mobile manipulators developed by Shanghai University

    A note on pentavalent s-transitive graphs

    Get PDF
    AbstractA graph, with a group G of its automorphisms, is said to be (G,s)-transitive if G is transitive on s-arcs but not on (s+1)-arcs of the graph. Let X be a connected (G,s)-transitive graph for some s≥1, and let Gv be the stabilizer of a vertex v∈V(X) in G. In this paper, we determine the structure of Gv when X has valency 5 and Gv is non-solvable. Together with the results of Zhou and Feng [J.-X. Zhou, Y.-Q. Feng, On symmetric graphs of valency five, Discrete Math. 310 (2010) 1725–1732], the structure of Gv is completely determined when X has valency 5. For valency 3 or 4, the structure of Gv is known

    DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning

    Full text link
    Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts composed of known knowledge without training samples. Standard CZSL either identifies visual primitives or enhances unseen composed entities, and as a result, entanglement between state and object primitives cannot be fully utilized. Admittedly, vision-language models (VLMs) could naturally cope with CZSL through tuning prompts, while uneven entanglement leads prompts to be dragged into local optimum. In this paper, we take a further step to introduce a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to better tap the potential of VLMs in CZSL. Specifically, the state and object primitives are deemed as learnable tokens of vocabulary embedded in prompts and tuned on seen compositions. Instead of jointly tuning state and object, we devise a disentangled and recurrent tuning strategy to suppress the traction force caused by entanglement and gradually optimize the token parameters, leading to a better prompt space. Notably, we develop a progressive fine-tuning procedure that allows for incremental updates to the prompts, optimizing the object first, then the state, and vice versa. Meanwhile, the optimization of state and object is independent, thus clearer features can be learned to further alleviate the issue of entangling misleading optimization. Moreover, we quantify and analyze the entanglement in CZSL and supplement entanglement rebalancing optimization schemes. DRPT surpasses representative state-of-the-art methods on extensive benchmark datasets, demonstrating superiority in both accuracy and efficiency
    • …
    corecore